Framing the problem¶

The goal of this report is to assist a local car dealership in improving their business by accurately predicting car prices. By using historical car sale data, we will develop a predictive model that estimates the selling price of cars based on various attributes. Additionally, the report will explore other valuable insights from the data to help the dealership make informed business decisions.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly as plotly
import plotly.express as px
plotly.offline.init_notebook_mode()

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

Getting the data¶

For this report, we will be using this dataset. Kaggle

This dataset covers 8 years of inventory in Canada. Each day their systems crawl and aggregate inventory from over 65k dealer websites to deliver the most comprehensive and up-to-date depictions of market activity available anywhere.

Individual listing records show year, make, model and trim, with VIN-level histories, showing the most recent time the car showed up online back to the earliest, with every change that occurred over that time.

Equipment breakdowns give fuel type, engine size, transmission, color, driveline and body style with links back to the VDP of the source we obtained the information from.

In [ ]:
df = pd.read_csv('../data/ca-dealers-used.csv')
df.head(2)
C:\Users\Owner\AppData\Local\Temp\ipykernel_11404\18213597.py:1: DtypeWarning:

Columns (13,15) have mixed types. Specify dtype option on import or set low_memory=False.

Out[ ]:
id vin price miles stock_no year make model trim body_type ... drivetrain transmission fuel_type engine_size engine_block seller_name street city state zip
0 b39ea795-eca9 19UNC1B01HY800062 179999.0 9966.0 V-P4139 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V edmundston honda 475 Rue Victoria Edmundston NB E3V 2K7
1 026cb5b1-6e3e 19UNC1B02HY800023 179995.0 5988.0 PPAP70374 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V garage daniel lessard 2795 Route-du-prsident-kennedy Notre-dame-des-pins QC G0M 1K0

2 rows × 21 columns

As we can see from the above, the dataset has some issues in the 13th and 15th column. So, now we will reload the dataset by specifying the types for those columns.

In [ ]:
dtype_spec = {
    'fuel_type': 'object',
    'engine_block': 'object'
}
In [ ]:
df = pd.read_csv('../data/ca-dealers-used.csv', dtype=dtype_spec)
df.head(2)
Out[ ]:
id vin price miles stock_no year make model trim body_type ... drivetrain transmission fuel_type engine_size engine_block seller_name street city state zip
0 b39ea795-eca9 19UNC1B01HY800062 179999.0 9966.0 V-P4139 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V edmundston honda 475 Rue Victoria Edmundston NB E3V 2K7
1 026cb5b1-6e3e 19UNC1B02HY800023 179995.0 5988.0 PPAP70374 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V garage daniel lessard 2795 Route-du-prsident-kennedy Notre-dame-des-pins QC G0M 1K0

2 rows × 21 columns

In [ ]:
df.shape
Out[ ]:
(393603, 21)

We've sorted that out and we can see from the above that our dataset consists of 393,603 rows and 21 columns, which we will explore in the next section.

Exploratory Data Analysis¶

In [ ]:
df.head(5)
Out[ ]:
id vin price miles stock_no year make model trim body_type ... drivetrain transmission fuel_type engine_size engine_block seller_name street city state zip
0 b39ea795-eca9 19UNC1B01HY800062 179999.0 9966.0 V-P4139 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V edmundston honda 475 Rue Victoria Edmundston NB E3V 2K7
1 026cb5b1-6e3e 19UNC1B02HY800023 179995.0 5988.0 PPAP70374 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V garage daniel lessard 2795 Route-du-prsident-kennedy Notre-dame-des-pins QC G0M 1K0
2 5cd5d5b2-5cc2 19UNC1B02HY800071 168528.0 24242.0 B21085 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V lougheed acura 1388 Lougheed Highway Coquitlam BC V3K 6S4
3 b32473ed-5922 19UNC1B02LY800001 220000.0 6637.0 AP5333 2020.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V drive autogroup 1305 Parkway Suite 600 Pickering ON L1V 3P2
4 ac40c9fc-0676 19UNC1B02LY800001 220000.0 6637.0 AP5333 2020.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V acura pickering 575 Kingston Road Pickering ON L1V 3N7

5 rows × 21 columns

Overview of the Dataset¶

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 393603 entries, 0 to 393602
Data columns (total 21 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            393603 non-null  object 
 1   vin           393603 non-null  object 
 2   price         358486 non-null  float64
 3   miles         366590 non-null  float64
 4   stock_no      365910 non-null  object 
 5   year          393586 non-null  float64
 6   make          393603 non-null  object 
 7   model         388809 non-null  object 
 8   trim          354824 non-null  object 
 9   body_type     359578 non-null  object 
 10  vehicle_type  355365 non-null  object 
 11  drivetrain    354608 non-null  object 
 12  transmission  357922 non-null  object 
 13  fuel_type     322790 non-null  object 
 14  engine_size   320950 non-null  float64
 15  engine_block  320439 non-null  object 
 16  seller_name   391371 non-null  object 
 17  street        385674 non-null  object 
 18  city          385829 non-null  object 
 19  state         385767 non-null  object 
 20  zip           385834 non-null  object 
dtypes: float64(4), object(17)
memory usage: 63.1+ MB

From the above, we can see all the 21 columns that we have and what they represent.

In [ ]:
df.describe()
Out[ ]:
price miles year engine_size
count 3.584860e+05 3.665900e+05 393586.000000 320950.000000
mean 2.601902e+04 7.566339e+04 2016.414829 2.785073
std 2.064007e+04 5.775442e+04 3.345400 1.236639
min 0.000000e+00 0.000000e+00 1981.000000 0.600000
25% 1.490000e+04 3.491375e+04 2015.000000 2.000000
50% 2.190000e+04 6.232800e+04 2017.000000 2.400000
75% 3.199500e+04 1.032830e+05 2019.000000 3.500000
max 1.288888e+06 2.300033e+06 2022.000000 8.400000

Of the 21 columns, only 4 are numeric and the rest are non numeric.

In [ ]:
df.isnull().sum()
Out[ ]:
id                  0
vin                 0
price           35117
miles           27013
stock_no        27693
year               17
make                0
model            4794
trim            38779
body_type       34025
vehicle_type    38238
drivetrain      38995
transmission    35681
fuel_type       70813
engine_size     72653
engine_block    73164
seller_name      2232
street           7929
city             7774
state            7836
zip              7769
dtype: int64

We have a lot of missing values in our dataset, looking at the price column for example, we have 35,117 missing values, which represents about 10% of our dataset. Because our dataset is large enough, we can afford to remove all the rows with no price value.

In [ ]:
import plotly.express as px

# Calculate the percentage of missing values for each column
missing_values = df.isnull().mean() * 100

# Filter out columns with no missing values
missing_values = missing_values[missing_values > 0]

# Create a larger pie chart
fig = px.pie(values=missing_values, names=missing_values.index, title='Missing Values Percentage by Column', width=800, height=800)
fig.show()

We will decide what to do with the rest of the columns as we go on.

But for now, we will remove all the empty rows in the price column from the dataset.

In [ ]:
df = df.dropna(subset=['price'])
df['price'].isnull().sum()
Out[ ]:
0

Let us look at our price column.

In [ ]:
# visualize the distribution of the target variable (price) using plotly bar chart
fig = px.histogram(df, x='price', title='Price Distribution')
fig.show()

The reason why our graph is so right skewed is because of the presence of outliers in the column. Let us use the boxplot to examine this further.

In [ ]:
fig = px.box(df, x='price', title='Price Distribution')
fig.show()

Looking at the distribution of the price, we can see that the prices of some cars are as high as $1.2M etc, unless these are some very luxurious cars, that was probably a mistake. Let us see what percentage of the cars are worth more than $100k.

In [ ]:
df[df['price'] > 100000]
Out[ ]:
id vin price miles stock_no year make model trim body_type ... drivetrain transmission fuel_type engine_size engine_block seller_name street city state zip
0 b39ea795-eca9 19UNC1B01HY800062 179999.0 9966.0 V-P4139 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V edmundston honda 475 Rue Victoria Edmundston NB E3V 2K7
1 026cb5b1-6e3e 19UNC1B02HY800023 179995.0 5988.0 PPAP70374 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V garage daniel lessard 2795 Route-du-prsident-kennedy Notre-dame-des-pins QC G0M 1K0
2 5cd5d5b2-5cc2 19UNC1B02HY800071 168528.0 24242.0 B21085 2017.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V lougheed acura 1388 Lougheed Highway Coquitlam BC V3K 6S4
3 b32473ed-5922 19UNC1B02LY800001 220000.0 6637.0 AP5333 2020.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V drive autogroup 1305 Parkway Suite 600 Pickering ON L1V 3P2
4 ac40c9fc-0676 19UNC1B02LY800001 220000.0 6637.0 AP5333 2020.0 Acura NSX Base Coupe ... 4WD Automatic Electric / Premium Unleaded 3.5 V acura pickering 575 Kingston Road Pickering ON L1V 3N7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
387560 75485435-0109 ZPBUA1ZLXKLA03998 259000.0 10287.0 904888 2019.0 Lamborghini Urus Base SUV ... 4WD Automatic Premium Unleaded 4.0 V hgrgoire 625 Rue Dubois Saint-eustache QC J7P 3W1
389789 4c9c3f8f-f6d3 4JGFB5KB8MA310456 100177.0 90.0 21028102 2021.0 Mercedes-Benz GLE GLE450 SUV ... 4WD Automatic Electric / Unleaded 3.0 I mercedes-benz north vancouver 1375 Marine Drive North Vancouver BC V7P 3E5
393547 1b392ca8-b5c8 W1KUG7DB1LA529608 120790.0 4995.0 220351 2020.0 Mercedes-Benz S-Class S560e Sedan ... RWD Automatic Electric / Unleaded 3.0 V mercedes-benz sherbrooke 4787, Boul. Bourque Sherbrooke QC J1N 2G6
393601 efa897c2-a52f WP1BE2AY4LDA55072 115995.0 NaN NaN 2020.0 Porsche Cayenne Coupe E-Hybrid SUV ... 4WD Automatic Electric / Unleaded 3.0 V porsche centre winnipeg 660 Pembina Highway Winnipeg MB R3M 2M5
393602 479607ed-62af WP1BE2AY8LDA55284 113530.0 7595.0 P16107 2020.0 Porsche Cayenne Coupe E-Hybrid SUV ... 4WD Automatic Electric / Unleaded 3.0 V h.j. pfaff motors 33 Auto Park Cir Woodbridge ON L4L 8R1

1840 rows × 21 columns

Perusing through the columns which prices more than 100k, I can see some Porsches and some Lamborghinis, which are nice, but for this report, being done for a local small business delearship, we will exclude these cars so as not to skew our dataset.

In [ ]:
df.shape
Out[ ]:
(358486, 21)
In [ ]:
df = df[df['price'] <= 100000]
df.shape
Out[ ]:
(356646, 21)
In [ ]:
fig = px.histogram(df, x='price', title='Price Distribution')
fig.show()
In [ ]:
fig = px.box(df, x='price', title='Price Distribution')
fig.show()

We can see that our prices are relatively normally distributed, with the median car price coming in at around $22k.

Let us take care the rest of the dataset.

In [ ]:
df.isnull().sum()
Out[ ]:
id                  0
vin                 0
price               0
miles           22414
stock_no        23142
year               15
make                0
model            4182
trim            33175
body_type       30610
vehicle_type    34539
drivetrain      33155
transmission    30371
fuel_type       62233
engine_size     64009
engine_block    64353
seller_name      2042
street           6601
city             6564
state            6624
zip              6561
dtype: int64
In [ ]:
numeric_df = df.select_dtypes(include=[np.number])
numeric_df.columns
Out[ ]:
Index(['price', 'miles', 'year', 'engine_size'], dtype='object')
In [ ]:
non_numeric_df = df.select_dtypes(exclude=[np.number])
non_numeric_df.columns
Out[ ]:
Index(['id', 'vin', 'stock_no', 'make', 'model', 'trim', 'body_type',
       'vehicle_type', 'drivetrain', 'transmission', 'fuel_type',
       'engine_block', 'seller_name', 'street', 'city', 'state', 'zip'],
      dtype='object')
In [ ]:
# for all columns in the non-numeric dataframe, print the number of unique values
for col in non_numeric_df.columns:
    display(f'{col}: {non_numeric_df[col].nunique()}')
    
'id: 356646'
'vin: 204955'
'stock_no: 175130'
'make: 55'
'model: 850'
'trim: 1462'
'body_type: 21'
'vehicle_type: 2'
'drivetrain: 3'
'transmission: 2'
'fuel_type: 24'
'engine_block: 3'
'seller_name: 4980'
'street: 4805'
'city: 758'
'state: 15'
'zip: 3523'
In [ ]:
#no of empty rows in the non_numeric_df
non_numeric_df.isnull().sum()
Out[ ]:
id                  0
vin                 0
stock_no        23142
make                0
model            4182
trim            33175
body_type       30610
vehicle_type    34539
drivetrain      33155
transmission    30371
fuel_type       62233
engine_block    64353
seller_name      2042
street           6601
city             6564
state            6624
zip              6561
dtype: int64

For our categorical variables, we will replace all the missing values with a placeholder like 'missing'.

In [ ]:
# for all columns in the non-numeric dataframe, replace missing values with 'Missing' in df
for col in non_numeric_df.columns:
    df[col] = df[col].fillna('Missing')
    
# check if there are any missing values in the non-numeric dataframe
df[non_numeric_df.columns].isnull().sum()
Out[ ]:
id              0
vin             0
stock_no        0
make            0
model           0
trim            0
body_type       0
vehicle_type    0
drivetrain      0
transmission    0
fuel_type       0
engine_block    0
seller_name     0
street          0
city            0
state           0
zip             0
dtype: int64

For the numeric columns, we will replace the misisng values with the median of the respective columns.

In [ ]:
numeric_df.isnull().sum()
Out[ ]:
price              0
miles          22414
year              15
engine_size    64009
dtype: int64
In [ ]:
# for all columns in the numeric dataframe, replace missing values with the median value in df
for col in numeric_df.columns:
    df[col] = df[col].fillna(df[col].median())
In [ ]:
df.isnull().sum()
Out[ ]:
id              0
vin             0
price           0
miles           0
stock_no        0
year            0
make            0
model           0
trim            0
body_type       0
vehicle_type    0
drivetrain      0
transmission    0
fuel_type       0
engine_size     0
engine_block    0
seller_name     0
street          0
city            0
state           0
zip             0
dtype: int64

We have no more missing values in our dataset, now we can proceed with our analysis.

In [ ]:
numeric_df = df.select_dtypes(include=[np.number])
non_numeric_df = df.select_dtypes(exclude=[np.number])

Correlation Analysis¶

In [ ]:
import seaborn as sns

correlation_matrix = numeric_df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
No description has been provided for this image

We can see that the engine size and year the car was made has a relatively positive correlation with the price, while the miles has a negative correlation to the price.

What this hints at is that these features most likely have an effect on the price of a car.

In [ ]:
# using plotly to show the scatter plot of the relationship between the price and all the columns in the numeric dataframe except price
for col in numeric_df.columns:
    if col != 'price':
        fig = px.scatter(df, x=col, y='price', title=f'{col} vs Price')
        fig.show()

The above graphs gives us a bit of a glimpse into how these features correlates with our price value.

Looking at the price v miles, although very heavily congested, you can see that the price values gradually dwindles as the miles of the car increases.

Similarly, in the price v year, looking at the top left part of the graph, there aren't too many data points there, which tells us that older cars are less likely to cost much, unless they are vintage or luxurious cars.

The engine size also follows a similar flow, but interesting to notice that the engine size of 7 and 8 do not necessarily show that the cars will be more exoensive as the engine size goes up.

Feature Engineering¶

Here, we will play around with some of our features to either get more features or drop some features that we believe are redundant. We will also normalize, scale and encode our features to get them ready for our modelling.

¶

In [ ]:
df.isnull().sum()
Out[ ]:
id              0
vin             0
price           0
miles           0
stock_no        0
year            0
make            0
model           0
trim            0
body_type       0
vehicle_type    0
drivetrain      0
transmission    0
fuel_type       0
engine_size     0
engine_block    0
seller_name     0
street          0
city            0
state           0
zip             0
dtype: int64
In [ ]:
df_copy = df.copy()
In [ ]:
df_copy.isnull().sum()
Out[ ]:
id              0
vin             0
price           0
miles           0
stock_no        0
year            0
make            0
model           0
trim            0
body_type       0
vehicle_type    0
drivetrain      0
transmission    0
fuel_type       0
engine_size     0
engine_block    0
seller_name     0
street          0
city            0
state           0
zip             0
dtype: int64

We will replace the year of manufacture of the car with the age of the car.

In [ ]:
# create age column by subtracting the year from the year of manufacture
df_copy['age'] = 2024 - df_copy['year']

df_copy = df_copy.drop(columns=['year'])
df_copy.head(5)
Out[ ]:
id vin price miles stock_no make model trim body_type vehicle_type ... transmission fuel_type engine_size engine_block seller_name street city state zip age
16 a7b53aaa-ae99 1B3LC56R18N652133 3499.0 174850.0 Missing Dodge Avenger SXT Sedan Car ... Automatic E85 2.7 V carview motor 1113 Finch Ave W North York ON M3J 2E5 16.0
17 acdf10ba-fa76 1B3LC56R18N660054 5200.0 89124.0 215140 Dodge Avenger SXT Sedan Car ... Automatic E85 2.7 V strickland's brantford chevrolet buick gmc cad... 19-21 Lynden Road Brantford ON N3R 8B8 16.0
18 3ec82d01-aec0 1B3LC56R38N195597 4789.0 151745.0 30479 Dodge Avenger SXT Sedan Car ... Automatic E85 2.7 V dale wurfel chrysler dodge jeep ltd. 28478 Centre Road Strathroy ON N7G 3J2 16.0
19 19cab0ff-35f0 1B3LC56R58N274804 4601.0 252955.0 KH6268B Dodge Avenger SXT Sedan Car ... Automatic E85 2.7 V medicine hat toyota 1400 Strachan Road Southeast Medicine Hat AB T1B 4V2 16.0
20 d128f381-0f23 1B3LC56R88N566740 5995.0 176631.0 2008DAVGWHT Dodge Avenger SXT Sedan Car ... Automatic E85 2.7 V first edmonton auto 8303 118 Avenue Northwest Edmonton AB T5B 0S4 16.0

5 rows × 21 columns

We will also group the different fuel type classes we have into these 4 classes below.

In [ ]:
gasoline = {'gas', 
            'E85 / Unleaded',
       'Unleaded', 'Premium Unleaded',
       'Premium Unleaded; Unleaded', 
       'Unleaded; Unleaded / E85', 'Unleaded / E85',
       'E85 / Unleaded; Unleaded', 'Premium Unleaded / Unleaded',
       'E85 / Premium Unleaded; E85 / Unleaded', 
       'E85', 'E85 / Premium Unleaded', 
       'Compressed Natural Gas; Unleaded',
       'E85 / Unleaded; Unleaded / Unleaded',
       'Diesel / Premium Unleaded', 'E85 / Unleaded; Unleaded / E85',
       'Unleaded / Unleaded', 
       'Compressed Natural Gas / Unleaded', 'Diesel; Unleaded',
       'Diesel; E85 / Unleaded', 'E85 / Unleaded; Premium Unleaded',
       'Premium Unleaded; Premium Unleaded / E85', 'E85; E85 / Unleaded',
       'Unleaded / Premium Unleaded',
       'Premium Unleaded / E85',
       'M85 / Unleaded'
}

diesel = {
    'Biodiesel'
}

hybrid = {
       'Electric / Premium Unleaded', 
       'Electric / Unleaded',
       'Unleaded / Electric',
       'Electric / Hydrogen',
       'Electric / Premium Unleaded; Electric / Unleaded',
       'Electric / Premium Unleaded; Premium Unleaded',
       'Electric / E85'
}

other = {
    'Hydrogen', 
    'Premium Unleaded / Natural Gas',
    'Compressed Natural Gas / Lpg', 
    'Compressed Natural Gas', 'Propane',
    'Flex Fuel Vehicle',
}
In [ ]:
def fuel_preprocess(x):
    if x in gasoline:
        return 'gasoline'
    if x in diesel:
        return 'diesel'
    if x in hybrid:
        return 'hybrid'
    if x in other:
        return 'other'
    return x
In [ ]:
df_copy['fuel_type'] = df_copy['fuel_type'].apply(fuel_preprocess)
df_copy['fuel_type'] = df_copy['fuel_type'].apply(lambda x : x.lower())

Let us drop some columns that we will not be using in our analysis for sure.

In [ ]:
#drop the columns street, city, state, vin, stock
df_copy = df_copy.drop(columns=['id', 'street', 'city', 'state', 'vin', 'stock_no', 'zip', 'seller_name'])
df_copy.head(5)
Out[ ]:
price miles make model trim body_type vehicle_type drivetrain transmission fuel_type engine_size engine_block age
16 3499.0 174850.0 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 2.7 V 16.0
17 5200.0 89124.0 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 2.7 V 16.0
18 4789.0 151745.0 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 2.7 V 16.0
19 4601.0 252955.0 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 2.7 V 16.0
20 5995.0 176631.0 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 2.7 V 16.0

Next, we will scale our numeric values so the model doesn't attribute higher weights to values that are of a different scale from other numeric columns.

In [ ]:
#scale the numeric columns
scaler = RobustScaler()
#get the numeric columns
numeric_df = df_copy.select_dtypes(include=[np.number])
df_copy[numeric_df.columns] = scaler.fit_transform(df_copy[numeric_df.columns])

df_copy.head(5)
Out[ ]:
price miles make model trim body_type vehicle_type drivetrain transmission fuel_type engine_size engine_block age
16 -1.080117 1.861594 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 0.230769 V 2.25
17 -0.980206 0.452114 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 0.230769 V 2.25
18 -1.004347 1.481709 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 0.230769 V 2.25
19 -1.015389 3.145772 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 0.230769 V 2.25
20 -0.933510 1.890877 Dodge Avenger SXT Sedan Car FWD Automatic gasoline 0.230769 V 2.25

We will split our dataset into features and target for further analysis.

In [ ]:
X = df_copy.drop(columns=['price'])
y = df_copy['price']

Very important step, we encode our categorical features, because as we know, we cannot feed strings or objects to our models, only numeric values.

In [ ]:
from sklearn import metrics, preprocessing
from keras.models import Model
from sklearn.preprocessing import LabelEncoder
import category_encoders as ce
In [ ]:
encoder = ce.TargetEncoder(cols=X.select_dtypes(include='object').columns)

X_encoded = encoder.fit_transform(X, y)
X_encoded.head(5)
Out[ ]:
miles make model trim body_type vehicle_type drivetrain transmission fuel_type engine_size engine_block age
16 1.861594 -0.029734 -0.858752 -0.275362 -0.213189 -0.16285 -0.343581 0.238349 0.19941 0.230769 0.701507 2.25
17 0.452114 -0.029734 -0.858752 -0.275362 -0.213189 -0.16285 -0.343581 0.238349 0.19941 0.230769 0.701507 2.25
18 1.481709 -0.029734 -0.858752 -0.275362 -0.213189 -0.16285 -0.343581 0.238349 0.19941 0.230769 0.701507 2.25
19 3.145772 -0.029734 -0.858752 -0.275362 -0.213189 -0.16285 -0.343581 0.238349 0.19941 0.230769 0.701507 2.25
20 1.890877 -0.029734 -0.858752 -0.275362 -0.213189 -0.16285 -0.343581 0.238349 0.19941 0.230769 0.701507 2.25

Our dataset is now in a good shape for us to proceed.

Splitting the Data¶

We will split our dataset into training, validation and test data. We have chosen to do this because we want to use different models and we need a spearate dataset to validate them on before using them on our test set.

70% for our training, 15% each for validation and testing.

In [ ]:
X_train, X_temp, y_train, y_temp = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

X_valid, X_test, y_valid, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

X_train.shape, X_valid.shape, X_test.shape
Out[ ]:
((285316, 12), (35665, 12), (35665, 12))

Before we proceed to our modelling, we currently have 12 columns, to reduce the complexity and the time the model takes to run, we want to reduce the number of features to only the important features.

We will be employing the Random Forest Regressor to do this.

In [ ]:
from sklearn.ensemble import RandomForestRegressor

# Create a Random Forest regressor
rf_reg = RandomForestRegressor()

# Fit the Random Forest model on the training data
rf_reg.fit(X_train, y_train)

# Get feature importance scores
feature_importance = rf_reg.feature_importances_

# Get feature names
feature_names = X_train.columns

# Create a dictionary of feature names and their importance scores
feature_importance_dict = dict(zip(feature_names, feature_importance))

# Sort feature importance in descending order
sorted_feature_importance = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

# Print feature importance in descending order
print("Feature Importance:")
for feature, importance in sorted_feature_importance:
    print(f"{feature}: {importance}")

# Plot feature importance
plt.figure(figsize=(10, 6))
plt.barh(range(len(feature_importance)), [importance for feature, importance in sorted_feature_importance], align='center')
plt.yticks(range(len(feature_importance)), [feature for feature, importance in sorted_feature_importance])
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("Feature Importance")
plt.show()
Feature Importance:
model: 0.563868077246823
age: 0.26741747815328754
trim: 0.0528600239072123
miles: 0.049257555585983034
make: 0.022834418905164137
engine_size: 0.019899818092682497
body_type: 0.007356206619016892
drivetrain: 0.006261621163743797
fuel_type: 0.003528214692941907
vehicle_type: 0.002744793371402196
engine_block: 0.002356242756622099
transmission: 0.0016155495051207779
No description has been provided for this image
In [ ]:
#plot the feature importance using plotly
fig = px.bar(x=[feature for feature, importance in sorted_feature_importance], y=[importance for feature, importance in sorted_feature_importance], title='Feature Importance')
fig.update_xaxes(title_text='Feature')
fig.update_yaxes(title_text='Importance')
fig.show()
In [ ]:
important_features = [feature for feature, importance in sorted_feature_importance if importance > 0.01]
important_features
Out[ ]:
['model', 'age', 'trim', 'miles', 'make', 'engine_size']

From the above, we can see that using only the model, age, trim, miles/mileage, make and engine_size of a car, we will be able to make a fairly accurate prediction of the price, without all the other information of the car.

Modelling¶

We will be using only these features for our modelling.

In [ ]:
X_train = X_train[important_features]
X_valid = X_valid[important_features]
X_test = X_test[important_features]

Performance on the Validation set¶

Linear Regression¶

In [ ]:
from sklearn.linear_model import LinearRegression

model_lr = LinearRegression()
model_lr.fit(X_train, y_train)
y_pred_train_lr = model_lr.predict(X_train)
y_pred_valid_lr = model_lr.predict(X_valid)
In [ ]:
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, r2_score
In [ ]:
r2_train_lr = r2_score(y_train, y_pred_train_lr)
r2_valid_lr = r2_score(y_valid, y_pred_valid_lr)

mae_train_lr = mean_absolute_error(y_train, y_pred_train_lr)
mae_valid_lr = mean_absolute_error(y_valid, y_pred_valid_lr)

mape_train_lr = mean_absolute_percentage_error(y_train, y_pred_train_lr)
mape_valid_lr = mean_absolute_percentage_error(y_valid, y_pred_valid_lr)

metrics_df_lr = pd.DataFrame({'r2': [r2_train_lr, r2_valid_lr],
                            'mae': [mae_train_lr, mae_valid_lr],
                            'mape': [mape_train_lr, mape_valid_lr]},
                           index=['train', 'valid'])
metrics_df_lr
Out[ ]:
r2 mae mape
train 0.811052 0.254411 1.325142e+12
valid 0.815736 0.253285 1.198900e+12

Random Forest Regressor¶

In [ ]:
from sklearn.ensemble import RandomForestRegressor

model_rfr = RandomForestRegressor()
model_rfr.fit(X_train, y_train)

y_pred_train_rfr = model_rfr.predict(X_train)
y_pred__valid_rfr = model_rfr.predict(X_valid)
In [ ]:
r2_train_rfr = r2_score(y_train, y_pred_train_rfr)
r2_valid_rfr = r2_score(y_valid, y_pred__valid_rfr)

mae_train_rfr = mean_absolute_error(y_train, y_pred_train_rfr)
mae_valid_rfr = mean_absolute_error(y_valid, y_pred__valid_rfr)

mape_train_rfr = mean_absolute_percentage_error(y_train, y_pred_train_rfr)
mape_valid_rfr = mean_absolute_percentage_error(y_valid, y_pred__valid_rfr)

metrics_df_rfr = pd.DataFrame({'r2': [r2_train_rfr, r2_valid_rfr],
                            'mae': [mae_train_rfr, mae_valid_rfr],
                            'mape': [mape_train_rfr, mape_valid_rfr]},
                           index=['train', 'valid'])
metrics_df_rfr
Out[ ]:
r2 mae mape
train 0.992056 0.036285 1.705072e+11
valid 0.965906 0.082242 3.623365e+11

Gradient Boosting Regressor¶

In [ ]:
from sklearn.ensemble import GradientBoostingRegressor

model_gbr = GradientBoostingRegressor()
model_gbr.fit(X_train, y_train)

y_pred_train_gbr = model_gbr.predict(X_train)
y_pred_valid_gbr = model_gbr.predict(X_valid)
In [ ]:
r2_train_gbr = r2_score(y_train, y_pred_train_gbr)
r2_valid_gbr = r2_score(y_valid, y_pred_valid_gbr)

mae_train_gbr = mean_absolute_error(y_train, y_pred_train_gbr)
mae_valid_gbr = mean_absolute_error(y_valid, y_pred_valid_gbr)

mape_train_gbr = mean_absolute_percentage_error(y_train, y_pred_train_gbr)
mape_valid_gbr = mean_absolute_percentage_error(y_valid, y_pred_valid_gbr)

#create a dataframe to store the metrics for the GradientBoostingRegressor model
metrics_df_gbr = pd.DataFrame({'r2': [r2_train_gbr, r2_valid_gbr],
                            'mae': [mae_train_gbr, mae_valid_gbr],
                            'mape': [mape_train_gbr, mape_valid_gbr]},
                           index=['train', 'valid'])
metrics_df_gbr
Out[ ]:
r2 mae mape
train 0.894189 0.187816 1.063444e+12
valid 0.895918 0.186801 1.027996e+12

From the 3 models we used above, we can see that the Random Forest Regressor performed the best, using the R squared value and the Mean Absolute Error, both on the training set and the validation set, compared to the other models.

Point to note, the high MAPE(Mean Absolute Percentage Error) values we see above are as a result of the division by very small numbers (close to 0 or 0) that happens while calculating the percentage and that doesn't affect our model.

In [ ]:
# compare the r2 score and the mean absolute error of the three models
metrics_df = pd.concat([metrics_df_lr, metrics_df_rfr, metrics_df_gbr], axis=0, keys=['Linear Regression', 'Random Forest', 'Gradient Boosting'])
metrics_df
Out[ ]:
r2 mae mape
Linear Regression train 0.811052 0.254411 1.325142e+12
valid 0.815736 0.253285 1.198900e+12
Random Forest train 0.992056 0.036285 1.705072e+11
valid 0.965906 0.082242 3.623365e+11
Gradient Boosting train 0.894189 0.187816 1.063444e+12
valid 0.895918 0.186801 1.027996e+12

We will proceed to use the Random Forest Regressor Model on our test dataset.

Performance on the test set¶

Random Forest Regressor¶

In [ ]:
y_pred_test_rfr = model_rfr.predict(X_test)

r2_test_rfr = r2_score(y_test, y_pred_test_rfr)
mae_test_rfr = mean_absolute_error(y_test, y_pred_test_rfr)

# create a dataframe to store the metrics for the Random Forest Regressor model
metrics_df_test_rfr = pd.DataFrame({'r2': r2_test_rfr, 'mae': mae_test_rfr}, index=['test'])
metrics_df_test_rfr
Out[ ]:
r2 mae
test 0.967028 0.082672

We have an R squared value of 0.967028, which we can take as an accuracy of 96.7%, since R squared values are between 0 and 1 and the closer the value is to 1, the better the model has performed.

Conclusion¶

After going through our analysis, we were able to get a model that predicts the market price of a car to an accuracy of 96.7%. We also were able to see the most important features that determines the price of a car.

Next Steps¶

Build an interactive UI component where end users can input the car specifications and we will predict the price of the car.